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ABSTRACT 



A method and apparatus for the automatic analysis, synthe- 
sis and modification of audio signals, based on an overlap- 
add sinusoidal model, is disclosed. Automatic analysis of 
amplitude, frequency and phase parameters of the model is 
achieved using an analysis-by-synthesis procedure which 
incorporates successive approximation, yielding synthetic 
waveforms which are very good approximations to the 
original waveforms and are perceptually identical to the 
original sounds. A generalized overlap-add sinusoidal model 
is introduced which can modify audio signals without objec- 
tionable artifacts. In addition, a new approach to pitch-scale 
modification allows for the use of arbitrary spectral envelope 
estimates and addresses the problems of high-frequency loss 
and noise amplification encountered with prior art methods. 
The overlap-add synthesis method provides the ability to 
synthesize sounds with computational efficiency rivaling 
that of synthesis using the discrete short-time Fourier trans- 
form (DSTFT) while eliminating the modification artifacts 
associated with that method. 
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SPEECH APPROXIMATION USING 
SUCCESSIVE SINUSOIDAL OVERLAP-ADD 
MODELS AND PITCH-SCALE 
MODIFICATIONS 

5 

This application is a continuation-in-part of U.S. Sen No. 
748,544 filed Aug. 22, 1991, now U.S. PaL No. 5,327,518, 
entitled "AUDIO ANALYSIS/SYNTHESIS SYSTEM." 

TECHNICAL HELD 10 

The present invention relates to methods and apparatus 
for acoustic signal processing and especially for audio 
analysis and synthesis. More particularly, the present inven- 
tion relates to the analysis and synthesis of audio signals is 
such as speech or music, whereby time-, frequency- and 
pitch-scale modifications may be introduced without per- 
ceptible distortion. 

BACKGROUND OF THE INVENTION 20 

For many years the most popular approach to representing 
speech signals parametrically has been linear predictive (LP) 
modeling. Linear prediction is described by J. Makhoul, 
'Linear Prediction: A Tutorial Review," Proc. IEEE, vol. 63, 25 
pp. 561-580, April 1975. In this approach, the speech 
production process is modeled as a linear time-varying, 
all-pole vocal tract filter driven by an excitation signal 
representing characteristics of the glottal waveform. While 
many variations on this basic model have been widely used 30 
in low bit-rate speech coding, the formulation known as 
pitch-excited LPC has been very popular for speech synthe- 
sis and modification as well. In pitch-excited LPC, the 
excitation signal is modeled either as a periodic pulse train 
for voiced speech or as white noise for unvoiced speech. By 35 
effectively separating and parameterizing the voicing state, 
pitch frequency and articulation rate of speech, pitch-excited 
LPC can flexibly modify analyzed speech as well as produce 
artificial speech given linguistic production rules (referred to 
as synthesis-by-rule). 40 

However, pitch-excited LPC is inherently constrained and 
suffers from well-known distortion characteristics. LP mod- 
eling is based on the assumption that the vocal tract may be 
modeled as an all-pole filter; deviations of an actual vocal 
tract from this ideal thus result in an excitation signal 45 
without the purely pulse-like or noisy structure assumed in 
the excitation model. Pitch-excited LPC therefore produces 
synthetic speech with noticeable and objectionable distor- 
tions. Also, LP modeling assumes a priori that a given signal 
is the output of a time-varying filter driven by an easily 50 
represented excitation signal, which limits its usefulness to 
those signals (such as speech) which are reasonably well 
represented by this structure. Furthermore, pitch-excited 
LPC typically requires a "voiced/unvoiced" classification 
and a pitch estimate for voiced speech; serious distortions 55 
result from errors in either procedure. Time-frequency rep- 
resentations of speech combine the observations that much 
speech information resides in the frequency domain and that 
speech production is an inherently non-stationary process. 
While many different types of time-frequency representa- 60 
tions exist, to date the most popular for the purpose of 
speech processing has been the short-time Fourier transform 
(STFT). One formulation of the STFT, discussed in the 
article by J. L. Flanagan and R. M. Golden, "Phase 
Vocoder," Bell Sys. Teck 7., vol. 45, pp. 1493-1509, 1966, 65 
and known as the digital phase vocoder (DPV), parameter- 
izes speech production information in a manner very similar 



2 

to LP modeling and is capable of performing speech modi- 
fications without the constraints of pitch-excited LPC. 

Unfortunately, the DPV is also computationally intensive, 
limiting its usefulness in real-time applications. An alternate 
approach to the problem of speech modification using the 
STFT is based on the discrete short-time Fourier transform 
(DSTFT), implemented using a Fast Fourier Transform 
(FFT) algorithm. This approach is described in the Ph.D, 
diesis of M. R. Portnoff, Time-Scale Modification of Speech 
Based on Short-Time Fourier Analysis, Massachusetts Insti- 
tute of Technology, 1978. While this approach is computa- 
tionally efficient and provides much of the functionality of 
the DPV, when applied to modifications the DSTFT gener- 
ates reverberant artifacts due to phase distortion. An iterative 
approach to phase estimation in the modified transform has 
been disclosed by D. W. Griffin and J. S. Lim in "Signal 
Estimation from Modified Short-Time Fourier Transform," 
IEEE Trans. OnAcoust., Speech and Signal Processing, vol. 
ASSP-32, no. 2, pp. 236-242, 1984. This estimation tech- 
nique reduces phase distortion, but adds greatly to the 
computation required for implementation. 

Sinusoidal modeling, which represents signals as sums of 
arbitrary amplitude- and frequency-modulated sinusoids, 
has recently been introduced as a high-quality alternative to 
LP modeling and the STFT and offers advantages over these 
approaches for synthesis and modification problems. As 
with the STFT, sinusoidal modeling operates without an 
"all-pole" constraint, resulting in more natural sounding 
synthetic and modified speech. Also, sinusoidal modeling 
does not require the restrictive "source/filter" structure of LP 
modeling; sinusoidal models are thus capable of represent- 
ing signals from a variety of sources, including speech from 
multiple speakers, music signals, speech in musical back- 
grounds, and certain biological and biomedical signals. In 
addition, sinusoidal models offer greater access to and 
control over speech production parameters than the STFT. 

The most notable and widely used formulation of sinu- 
soidal modeling is the Sine-Wave System introduced by 
McAulay and Quatieri, as described in their articles "Speech 
Analysis/Synthesis Based on a Sinusoidal Representation,** 
IEEE Trans, on Acoust., Speech and Signal Processing, vol. 
ASSP-34, pp. 744-754, August 1986, and "Speech Trans- 
formations Based on a Sinusoidal Representation/* IEEE 
Trans, on Acoust., Speech and Signal Processing, vol. 
ASSP-34, pp. 1449-1464, December 1986. The Sine- Wave 
System has proven to be useful in a wide range of speech 
processing applications, and the analysis and synthesis tech- 
niques used in the system are well-justified and reasonable, 
given certain assumptions. 

Analysis in the Sine- Wave System derives model param- 
eters from peaks of the spectrum of a windowed signal 
segment. The theoretical justification for this analysis tech- 
nique is based on an analogy to least-squares approximation 
of the segment by constant-amplitude, constant-frequency 
sinusoids. However, sinusoids of this form are not used to 
represent the analyzed signal; instead, synthesis is imple- 
mented with parameter tracks created by matching sinusoids 
from one frame to the next and interpolating the matched 
parameters using polynomial functions. 

This implementation, while making possible many of the 
applications of the system, represents an uncontrolled depar- 
ture from the theoretical basis of the analysis technique. This 
can lead to distortions, particularly during non-stationary 
portions of a signal. Furthermore, the matching and inter- 
polation algorithms add to the computational overhead of 
the system, and the continuously variable nature of the 
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parameter tracks necessitates direct evaluation of the sinu- 
soidal components at each sample point, a significant com- 
putational obstacle. A more computationally efficient syn- 
thesis algorithm for the Sine-Wave System has been 
proposed by McAulay and Quatieri in "Computationally 5 
Efficient Sine- Wave Synthesis and its Application to Sinu- 
soidal Transform Coding," Proc. IEEE Int'l Conf. on 
AcousU Speech and Signal Processing, pp. 370-373, April 
1988, But this algorithm departs even farther from the 
theoretical basis of analysis. 10 

Many techniques for the digital generation of musical 
sounds have been studied, and many are used in commer- 
cially available music synthesizers. In all of these techniques 
a basic tradeoff is encountered; namely, the conflict between 
accuracy and generality (defined as the ability to model a 15 
wide variety of sounds) on the one hand and computational 
efficiency on the other. Some techniques, such as frequency 
modulation (FM) synthesis as described by J. M. Chowning, 
"The Synthesis of Complex Audio Spectra by Means of 
Frequency Modulation " J. Audio Eng. Soc, vol. 21, pp. 20 
526-534, September 1973, are computationally efficient and 
can produce a wide variety of new sounds, but lack the 
ability to accurately model the sounds of existing musical 
instruments. 

On the other hand, sinusoidal additive synthesis imple- 25 
mented using the DPV is capable of analyzing the sound of 
a given instrument, synthesizing a perfect replica and per- 
forming a wide variety of modifications. However, as pre- 
viously mentioned, the amount of computation needed to 
calculate the large number of time-varying sinusoidal com- 
ponents required prohibits real-time synthesis using rela- 
tively inexpensive hardware. As in the case of time-fre- 
quency speech modeling, the computational problems of 
additive synthesis of musical tones may be addressed by 
formulating the DPV in terms of the DSTFT and to imple- 
ment this formulation using FFT algorithms. Unfortunately, 
this strategy produces the same type of distortion when 
applied to musical tone synthesis as to speech synthesis. 

There clearly exists a need for better methods and devices 4Q 
for the analysis, synthesis and modification of audio wave- 
forms. In particular, an analysis/synthesis system capable of 
altering the pitch frequency and articulation rate of speech 
and music signals and capable of operating with low com- 
putational requirements and therefore low hardware cost 45 
would satisfy long-felt needs and would contribute signifi- 
cantly to the art. 



SUMMARY OF THE INVENTION 
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The present invention addresses the above described 
limitations of the prior art and achieves a technical advance 
by provision of a method and structural embodiment com- 
prising: an analyzer responsive to either speech or musical 
tone signals which for each of a plurality of overlapping data 55 
frames extracts and stores parameters which serve to repre- 
sent input signals in terms of an overlap-add, quasi-har- 
monic sinusoidal model, and; a synthesizer responsive to the 
stored parameter set previously determined by analysis to 
produce a synthetic facsimile of the analyzed signal or 6( j 
alternately a synthetic audio signal advantageously modified 
in time-, frequency- or pitch-scale. 

In one embodiment of the present invention appropriate 
for speech signals, the analyzer Determines a time-varying 
gain signal representative of time- varying energy changes in 65 
the input signal. This time-varying gain is incorporated in 
the synthesis model and acts to improve modeling accuracy 



during transient portions of a signal. Also, given isolated 
frames of input signal and time- varying gain signal data the 
analyzer determines sinusoidal model parameters using a 
frequency-domain analysis-by- synthesis procedure imple- 
mented using a Fast Fourier Transform (FFT) algorithm. 
Advantageously, this analysis procedure overcomes inaccu- 
racies encountered with discrete Fourier transform "peak- 
picking" analysis as used in the Sine- Wave System, while 
maintaining a comparable computational load. Furthermore, 
a novel fundamental frequency estimation algorithm is 
employed which uses knowledge gained from analysis to 
improve computational efficiency over prior art methods. 

The synthesizer associated with this embodiment advan- 
tageously uses a refined modification model, which allows 
modified synthetic speech to be produced without the objec- 
tionable artifacts typically associated with modification 
using the DSTFT and other prior art methods. In addition, 
overlap-add synthesis may be implemented using an FFT 
algorithm, providing improved computational efficiency 
over prior art methods without departing significantly from 
the synthesis model used in analysis. 

The synthesizer also incorporates an improved phase 
coherence preservation algorithm which provides higher 
quality modified speech. Furthermore, the synthesizer per- 
forms pitch-scale modification using a phasor interpolation 
procedure. This procedure ek'minates the problems of infor- 
mation loss and noise migration often encountered in prior 
art methods of pitch modification. 

In an embodiment of the present invention appropriate for 
musical tone signals, a harmonically-constrained analysis- 
by-synthesis procedure is used to determine appropriate 
sinusoidal model parameters and a fundamental frequency 
estimate for each frame of signal data. This procedure allows 
for fine pitch tracking over the analyzed signal without 
significantly adding to the computational load of analysis. 
Due to a priori knowledge of pitch, the synthesizer associ- 
ated with this embodiment uses a simple functional con- 
straint to maintain phase coherence, significantly reducing 
the amount of computation required to perform modifica- 
tions. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 is a system level block diagram of a speech 
analyzer according to the present invention showing the 
required signal processing elements and their relationship to 
the flow of the information signals. 

FIG. 2 is a flowchart illustrating the information process- 
ing task which takes place in the time-varying calculator 
block of FIG. 1. 

FIG. 3 is an illustration of overlap-add synthesis, showing 
the relationship of windowed synthetic contributions and 
their addition to form a synthesis frame of s[n]. 

FIG. 4 is a functional block diagram illustrating the 
closed-loop analysis-by-synthesis procedure used in the 
invention. 

FIGS. 5 and 6 are flowcharts showing the information 
processing tasks achieved by the analysis-by-synthesis 
block of FIG. 1. 

FIGS. 7-9 are flowcharts showing the information pro- 
cessing tasks achieved by the fundamental frequency esti- 
mator block of FIG. 1. 

FIG. 10 is a flowchart showing the information processing 
tasks achieved by the harmonic assignment block of FIG. 1. 

FIG. 11 is a system level block diagram of a speech 
analyzer according to the present invention similar in opera- 
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tion to the speech analyzer of FIG. 1 but which operates 
without incorporating time-varying gain sequence a[n]. 

FIG. 12 is a system level block diagram of a musical tone 
analyzer according to the present invention showing the 
required signal processing elements and their relationship to 
the flow of the information signals. 

FIGS. 13-15 are flowcharts showing the information 
processing tasks achieved by the harmonically-constrained 
analysis-by-synthesis block of FIG. 12. 

FIG. 16 is a system level block diagram of a musical tone 
analyzer according to the present invention similar in opera- 
tion to the musical tone analyzer of FIG. 12 but which 
operates without incorporating time-varying gain sequence 
o[n]. 

FIG. 17 is a system level block diagram of a speech 
synthesizer according to the present invention, showing the 
required signal processing elements and their relationship to 
the flow of the information signals. 

FIGS. 18A and 18B are illustrations of distortion due to 
extrapolation beyond analysis frame boundaries. The phase 
coherence of s*[n] is seen to break down quickly outside the 
analysis frame due to the quasi-harmonic nature of the 
model. 

FIGS. 19A and 19B are illustrations of the effect of 25 
differential frequency scaling in the refined modification 
model. The phase coherence of the synthetic contribution 
breaks down more slowly due to "pulling in" the differential 
frequencies. 

FIGS. 20 and 21 are flowcharts showing the information 
processing tasks achieved by the pitch onset time-estimator 
block of HG. 17. 

FIGS. 22A and 22B are illustrations of virtual excitation 
sequences in both the unmodified and modified cases, and of 35 
the coherence constraint imposed on the sequences at 
boundary C. 

FIGS. 23 and 24 are flowcharts showing the information 
processing tasks achieved by the speech synthesizer DFT 
assignment block of FIG. 17. 40 

FIG. 25 is a system level block diagram of a speech 
synthesizer according to the present invention similar in 
operation to the speech synthesizer of FIG. 17 but which is 
capable of performing time- and pitch-scale modifications. 

FIGS. 26 and 27 are flowcharts showing the information 45 
processing tasks achieved by the phasor interpolator block 
of FIG. 25. 

FIG. 28 is a system level block diagram of a musical tone 
synthesizer according to the present invention showing the 
required signal processing elements and their relationship to 
the flow of the information signals. 

FIG. 29 is a system level block diagram of a musical tone 
synthesizer according to the present invention similar in 
operation to the musical tone synthesizer of FIG. 28 but 
which is capable of performing time- and pitch-scale modi- 
fications. 

FIG. 30 is a system level block diagram showing the 
architecture of a microprocessor implementation of the 
audio synthesis system of the present invention. 
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DETAILED DESCRIPTION 

FIG. 1 illustrates an analyzer embodiment of the present 
invention appropriate for die analysis of speech signals. 
Speech analyzer 100 of FIG. 1 responds to an analog speech 
signal, denoted by s c (t) and received via path 120, in order 



to determine the parameters of a signal model representing 
the input speech and to encode and store these parameters in 
storage element 113 via path 129, Speech analyzer 100 
digitizes and quantizes s c (t) using analog-to-digital (A/D) 
converter 101, according to the relation 



CD 



where F, is the sampling frequency in samples/sec and Q{*} 
represents the quantization operator of A/D converter 101. It 
is assumed that s c (t) is bandlimited to F/2 Hz. 

Time-varying gain calculator 102 responds to the data 
stream produced by A/D converter 101 to produce a 
sequence <j[n] which reflects time-varying changes in the 
average magnitude of s[n]. This sequence may be deter- 
mined by applying a lowpass digital filter to ls[n]l. One such 
filter is defined by the recursive relation 



y f [n]=Xy,[n-lMl-X)K-i[n], 1^/, 



(2) 



where y 0 [n]=ls[n]l. The time-varying gain sequence is then 
given by 



o[n]=y^n+n 0 ], 



(3) 



where n^ is the delay in samples introduced by filtering. The 
frequency response of this filter is given by 



30 



(4) 



where the filter parameters X and I determine the frequency 
selectivity and rolloff of the filter, respectively. For speech 
analysis, a fixed value of 1=20 is appropriate, while \ is 
varied as a function of F, according to 



(5) 



assuring that the filter bandwidth is approximately indepen- 
dent of the sampling frequency. The filter delay n^ can then 
be determined as 



■(■■A-)- 



(6) 



where <•> represents the "round to nearest integer" operator. 
A flowchart of this algorithm is shown in FIG. 2. lime- 
varying gain calculator 102 transmits a[n] via path 121 to 
parameter encoder 112 for subsequent transmission to stor- 
age element 113. 

It should be noted that any components of s[n] with 
frequencies close to FJ2 will be "aliased" into low-fre- 
quency components by the absolute value operator M, which 
can cause distortion in a[n]. Therefore, it is advisable to 
apply a lowpass filter to any s[n] known to contain signifi- 
cant high-frequency energy before taking the absolute value. 
Such a filter need only attenuate frequencies near F/2, thus 
it need not be complicated. One example is the simple filter 
defined by 



Jtnl=0.25j[n-lHO^JtnHO.25jtiH-l]. 



(7) 



60 



65 



Consider now the operation of speech analyzer 100 in 
greater detail. The signal model used in the invention to 
represent s[n] is an overiap-add sinusoidal model formula- 
tion which produces an approximation to s[n] given by 



(8) 
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where <j[n] controls the time- varying intensity of s[n], v/ s [n] 
is a complementary synthesis window which obeys the 
constraint 



Z w s [n-kN,]=U 



(9) 



and s*[n], the k-th synthetic contribution, is given by 



= Z A/cosfw/n + $/), 



(10) 



10 



where (a J k =2^f¥ s and where 0^f/£F f 12. The "synthesis 
firame length" N x typically corresponds to between 5 and 20 
msec, depending on application requirements. While an 
arbitrary complementary window function may be used for 
w,[n], a symmetric, tapered window such as a Hanning 
window of the form 



15 



cosHnnfWd, Inl £ N, 
0, otherwise 
is typically used. With this window, a synthesis frame of 
samples of s[n] may be written as 



20 



(12) 



for 0^n<N,. FIG. 3 illustrates a synthesis frame and the 
overlapping synthetic sequences which produce it. 

Given a[n], the objective of analysis is to determine 
amplitudes {A/}, frequencies {o^*} and phases for 
each s*[n] in Equation 8 such that s[n] is a "closest approxi- 
mation" to s[n] in some sense. An approach typically 
employed to solve problems of this type is to minimize the 
mean-square error 



(13) 



35 



No 

= Z w a [n){s[n+kN t ) 
n=>-Na 



(14) 



with respect to the amplitudes, frequencies and phases 
of s*[nL 

The analysis window w fl [n] may be an arbitrary positive 
function, but is typically a symmetric, tapered window 
which serves to force greater accuracy at the frame center, 
where the contribution of s*[n] to s[n] is dominant. One 
example is the Hamming window, given by 



w 0 [n) = | 



.54 + .46 cos(mt/Wa). 



otherwise. 



The analysis frame length may be a fixed quantity, but it is 
desirable in certain applications to have this parameter adapt 
to the expected pitch of a given speaker. For example, as 65 
discussed in U.S. Pat No. 4,885,790, issued to R. J. 
McAulay et al, the analysis frame length may be set to 2.5 



8 



times the expected average pitch period of the speaker to 
provide adequate frequency resolution. In order to ensure the 
accuracy of s[n], it is necessary that N^n^. 
Defining x[n] and g[n] by 



*M» <wJn])Mjl«+JWi] 



(16) 



and making use of Equation 10, E* may be rewritten as 



N a r 



= Z A^[n)cos(©/i + ^) 1 
J=l J 



(17) 



where frame notation has been omitted to simplify the 
equations. Unfortunately, without a priori knowledge of the 
frequency parameters, this minimization problem is highly 
nonlinear and therefore very difficult to solve. 

As an alternative, a slighdy suboptimal but relatively 
efficient analysis-by-synthesis algorithm may be employed 
to determine the parameters of each sinusoid successively. 
This algorithm operates as follows: Suppose the parameters 
of 1-1 sinusoids have been determined previously, yielding 
the successive approximation to x[n], 
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M 

xt-M = g[n) . Z Aj cos(Gyi + ty), 
7=1 

and the successive error sequence 



e,- l {ri\=x[n]-x h .l[n). 



(18) 



(19) 



in terms of the parameters of s[n]. However, attempting to 
solve this problem simultaneously for all the parameters 
may not be practical. 

Fortunately, if s[n] is approximately stationary over short 40 
time intervals, it is feasible to solve for the amplitude, 
frequency and phase parameters of s*[n] in isolation by 
approximating s[n] over an analysis frame of length 2N a +l 
samples centered at n=kN,. The overlapping frames of 
speech data and the accompanying frames of envelope data 45 
required for analysis are isolated from s[n] and o[n] respec- 
tively using frame segmenter blocks 103. The synthetic 
contribution s*[nj may then be determined by minimizing 



Given the initial conditions x o [n]=0 and e 0 [n]=x[n], these 
sequences may be updated recursively by 



Jt/[n]=jt / _ 1 [n}+s[rtJA^os((0/i-H>,) 
^nl^-iInHtnlA/COsCto,^,), 



(20) 



for I ^ 1 . The goal is then to minimize the squared successive 
error norm E,, given by 
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(21) 



Z {c/-i[nl - g cos(a>,n + fr) } 2 
n=-N a 



in terms of A,, oo / and 

At this point it is still not feasible to solve simultaneously 
for the parameters due to the embedded frequency and phase 
terms. However, assuming for the moment that oo, is fixed 
and recalling the trigonometric cos(cc+p)=cosacos|5- 
sinasinf), 

the expression for E, becomes 



(15) 60 



Z {cm In] = aig[n] oastom - big[n) sinaijn} 2 



(22) 



In this case the problem is clearly in the form of a linear 
least-squares approximation which when optimized in terms 
of a, and b, yields "normal equations" of the form 



(23) 
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where 



Til = g*[n] cos 1 ayi 



Yi2= E ^[nlcosvmsinwi 
n=-N a 

tn= £ , g 2 M sin 2 m 

rt=-/V fl 



72= 2 <H[»]«l»]«nflfr 



Solving for a, and b f gives 
MTnVa-TuVtyA. 



(24) 



(25) 
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where A=7 U Y22-Yi2 2 - Bv 1116 Principle of Orthogonality, 
given a 7 and b,, E , can be expressed as 



(26) 25 



Having determined a, and b„ A, and <)>, are then given by the 
relations 



(27) 



Tnis establishes a method for determining the optimal 
amplitude and phase parameters for a single sinusoidal 
component of i^n] at a given frequency. To determine an 35 
appropriate frequency for this sinusoid, an ensemble search 
procedure may be employed. While a variety of search 
strategies are possible, the most straightforward is an 
"exhaustive search," In this procedure, co, is varied over a set 
of uriiforrnly spaced candidate frequencies given by co c [i]= 40 
2in/M for Oii^M/2 (assuming that M is an even number). 
For each ai c [i], the corresponding value of E, is calculated 
using Equation 26, and co, is chosen as that value of <o c [i] 
which yields the minimum error. A, and $ f are then chosen 
as the amplitude and phase parameters associated with that 45 
frequency value. 

In order to guarantee that xfn] converges to x[n), it is 
necessary that M>2N a ; furthermore, in order to guarantee a 
level of accuracy which is independent of the analysis frame 
length, M should be proportional to N^, Le. 
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Due to a natural high-frequency attenuation in the vocal 
tract referred to as "spectral tilt," speech signals often have 
energy concentrated in the low-frequency range. This phe- 
nomenon, combined with the tendency of analysis-by-syn- 
thesis to select components in order of decreasing amplitude 
and with the fact that slight mismatches exist between 
speech signals and their sinusoidal representations, implies 
that analysis-by-synthesis tends to first choose high-ampli- 
tude components at low frequencies, then smaller sinusoids 
immediately adjacent in frequency to the more significant 
components. This "clustering" behavior slows the analysis 
algorithm by making more iterations necessary to capture 
perceptually important high-frequency information in 
speech. Furthermore, low-amplitude components clustered 
about high-amplitude components are perceptually irrel- 
evant, since they are "masked" by the larger sinusoids. As a 
result, expending extra analysis effort to determine them is 
wasteful. 

Two approaches have been considered for dealing with 
the effects of clustering. First, since clustering is caused 
primarily because high-frequency components in speech 
have small amplitudes relative to low-frequency compo- 
nents, one solution is to apply a high-pass filter to s[n] before 
analysis to make high-frequency components comparable in 
amplitude to low-frequency components. In order to be 
effective, the high-pass filter should approximately achieve 
a 6 dB/octave gain, although this is not critical. One simple 
filter which works well is defined by 



30 



*>MnH>.9f[n-l). 



(28) 



Since in this approach the "prefiltered" signal s p/ [n] is 
modeled instead of s[n] f the effects of prefiltering must be 
removed before producing synthetic speech. This may be 
done either by applying the inverse of the filter given by 
Equation 28 to s[n], or by removing the effects from the 
model parameters directly, using the formulas 



where 



(29) 



A second approach to the problem of clustering is based 
on the observation that low-amplitude sinusoids tend to 
cluster around a high-amplitude sinusoid only in the fre- 
quency range corresponding to the main lobe bandwidth of 
W^e" 0 ), the frequency spectrum of w fl [n]. Thus, given a 
component with frequency to, determined by analysis-by- 
synthesis, it may be assumed that no perceptually important 
components lie in the frequency range 



where v is typically greater than six. Finally, to facilitate 
computation it is often desirable to restrict M to be an integer 55 
power of two. For example, given the above conditions a 
suitable value of M for the case when N a =80 would be 
M=512. 

Having determined the parameters of the 1-th component, 
the successive approximation and error sequences are 60 
updated by Equation 20, and the procedure is repeated for 
the next component The number of components, J[k], may 
be fixed or may be determined in the analysis procedure 
according to various "closeness of fit" criteria well known in 
the art FIG. 4 shows a functional block diagram of the 65 
analysis procedure just described, illustrating its iterative, 
"closed-loop" structure. 



(30) 



where B TOl is the main lobe bandwidth of W^e 7 * 0 ). The 
frequency domain characteristics of a number of tapered 
windows are discussed by A. V. Oppenheim and R. W. 
Schafer in Discrete-Time Signal Processing, Englewood 
Cliffs, N J.: Prentice-Hall, 1989, pp. 447-449. Therefore, the 
proposed analysis-by-synthesis algorithm may be modified 
such that once a component with frequency CO; has been 
determined, frequencies in the range given by Equation 30 
are eliminated from the ensemble search thereafter, which 
ensures that clustering will not occur. 

The amount of computation required to perform analysis- 
by-synthesis is reduced greatly by recognizing that many of 
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the required calculations may be performed using a Fast 
Fourier Transform (FFT) algorithm. The M-point discrete 
Fourier transform (DFT) of an M-point sequence x[n] is 
defined by 



a M-l 

rv=0 



(3D 



where 



(32) 10 



reduce the amount of computation further, the identities 
described above may be used to update this DFT sequence. 

According to Equation 20, the updated error sequence 
after the 1-th component, ejn], is given by 

tf/fn^Mtnh^&lnJcosfain-Hh). (40) 

From this it is clear that the updated DFT EG^m] is then 

EG t [m) - EG,_j[m) - (41) 



When x[n] is a real-valued sequence the following identities 
hold: 



No /' 



Z x{n)cos((2n/M)mri)= m e{X{m)} 
n=0 



Recalling that G>f=27n/M, this becomes 



£ x[n]sin«2nMf)mn) = - ^ m{X[m]}. 
n=0 



For the purposes of analysis-by-synthesis the M-point 20 
DFTs of e^Jnjgfn] and g^n] are written as 



No (34) 
J?G M [m]= 2 ei-i[n)g[n)W% 
n=-N„ 

N tt 

GG[m] = X. g 2 [n)WJlf. 
n=-N a 

Noting that W/^W^, these DFTs may be case in 
the form of Equation 31 ^provided that M>2NJ by adding 
M to the negative summation index values and zero-padding 
the unused index values. 

Consider now the inner product expressions which must 
be calculated in the analysis-by-synthesis algorithm. From 
Equation 24, for the case of a3p« c [i]=2i7c/M, y n is given by 



25 
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£G i tm]=EG / _ 1 [m]^^* , Ca((m^/)) Jtf H^,c" > *'GGf((m+i / )) i/ ], 

(42) 

where ((•))*/ denotes the "modulo M" operator. EG,[m] can 
therefore be expressed as a simple linear combination of 
EG^m] and circularly shifted versions of GG[m]; this 
establishes a fast method of analysis-by-synthesis which 
operates in the frequency domain. A flowchart of this 
algorithm is given in FIGS. 5 and 6. 

It should be apparent to those skilled in the art that there 
are occasions when EGJm] will be a useful quantity in and 
of itself. For instance, if the goal of analyzing a signal made 
up of sinusoidal components plus noise is to determine the 
Fourier transform of the noise term, then EG z [m] corre- 
sponds to this quantity after removing the sinusoidal signal 
components. 

Recalling that e 0 [n]=oc[n] t then according to Equation 34, 



35 



EG Q [m] = XG[m] = X x\ n ] 8 [n}W^. 
n--N a 



(43) 



No 
rv=-N a 



Substituting the definitions of x[n] and g[n] from Equation 
(35) 16, XG[m] and GG[m] may be written as 



Using Equation 33 and recalling that cos 2 9=Vfc+V&cos26, this ^ 
becomes 



Yn^GCfO]-^ ft e{GG[2i)}. 



(36) 



Similarly, expressions for y 12 and y 22 can also be derived: 
y 12 =-V4$m{GG[2i]} 

y^ViGGlOYM « «{GG[2i]}. (37) 



XG[m] = Z <j} a [n)s[n + kNMn+kNs}W!it 
n=*-N a 



GG\m] = X G^ ) [n]CT 2 [n + JW / 3H^^ , ■, 
n=-N a 



(44) 



45 



The first three parameters may therefore be determined from 
the stored values of a single DFT which need only be 
calculated once per analysis frame using an FFT algorithm, so 
provided that M is a highly composite number Furthermore, 
if M is an integer power of 2, then the particularly efficient 
"radix-2" FFT algorithm may be used. A variety of FFT 
algorithms are described by A. V. Oppenheim and R. W. 
Schafer in Discrete-Time Signal Processing, Englewood 55 
Cliffs, N.J.: Prentice-Hall, 1989. 

Similar expressions for \|f j and \y 2 can be derived direcdy 
from the DFT identities given above: 



and 



Vi^eiEG^m 



y^SmlEG^Wh 



(38) 



60 



(39) 



These parameters may thus be expressed in terms of the 
stored values of EG^Jm]. However, since e^fn] changes 65 
for each new component added to the approximation, EG^ 
i[m] must be computed J[k] times per frame. In order to 



that is, XG[m] and GG[m], the two functions required for 
fast analysis-by-synthesis, are the zero-padded M-point 
DFT's of the sequences x[n]g[n] and g 2 [n], respectively, 
This first sequence is the product of the speech data frame 
and the envelope data frame multiplied by the analysis 
window function w fl [n]; likewise, g 2 [n) is simply the square 
of the envelope data frame multiplied by w a [n]. 

Referring to FIG. 1, multiplier block 104 responds to a 
frame of speech data received via path 123 and a frame of 
envelope data received via path 122 to produce the product 
of the data frames. Analysis window block 106 multiplies 
the output of multiplier block 104 by the analysis window 
function cojn], producing the sequence x[n]g[n] described 
above. Squarer block 105 responds to a frame of envelope 
data to produce the square of the data frame; the resulting 
output is input to a second analysis window block to produce 
the sequence g?[n]. At this point x[n]g[n] and g^fn] are input 
to parallel Fast Fourier Transform blocks 107, which yield 
the M-point DFT's XG[m] and GG[m], respectively. Analy- 
sis-by-synthesis block 108 responds to the input DFT's 
XG[m] and GG[m] to produce sinusoidal model parameters 
which approximate the speech data frame, using the fast 
analysis-by-synthesis algorithm discussed above. The result- 
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ing parameters are the amplitudes {A/}, frequencies {co^ } 
and phases {fy*} which produce s*[n], as shown in 
Equation 10. 

System estimator 110 responds to a frame of speech data 
transmitted via path 123 to produce coefficients representa- 
tive of HCe 7 "*), an estimate of the frequency response of the 
human vocal tract Algorithms to determine these coeffi- 
cients include linear predictive analysis, as discussed in U.S. 
Pat No. 3,740,476, issued to B. S. Atal, and homomorphic 
analysis, as discussed in U.S. Pat No. 4,885,790, issued to 
R. J. McAulay et al. System estimator 110 then transmits 
said coefficients via path 124 to parameter encoder 112 for 
subsequent transmission to storage element 113. 

In order to perform speech modifications using a sinusoi- 
dal model it is necessary for the frequency parameters 
associated with a given speech data frame to reflect the pitch 
information embedded in the frame. To this end, +e,fra s+ee 
*[n] may be written in quasi-harmonic form: 



J[k] 
/=0 



(45) 



10 



15 



20 



where ©/4<0 o *+^* t and where J[k] is now the greatest 
integer such that J[k] G> 0 k £n. Note that only one component 
is associated with each harmonic number j. With this for- 
mulation, the fundamental frequency (H 0 k ~27d 0 k /F s must 25 
now be determined. 

Fundamental frequency estimator 109 responds to the 
analyzed model parameter set from analysis-by-synthesis 
block 108 and to vocal tract frequency response coefficients 
received via path 124 to produce an estimate of the funda- 
mental frequency co 0 * of Equation 45. While many 
approaches to fundamental frequency estimation may be 
employed, a novel algorithm which makes use of the ana- 
lyzed sinusoidal model parameters in a fashion similar to the 
algorithm disclosed by McAulay and Quatieri in "Pitch 
Estimation and Voicing Detection Based on a Sinusoidal 
Speech Model," Proc. IEEE Int'l Conf. on Acoust, Speech 
and Signal Processing, pp. 249-252, April 1990, is 
described here: If co/ is defined as that value of go which 
minimizes the error induced by quantizing the frequency 
parameters to harmonic values, 



30 



35 



40 



E(<D) = 



si 



J[k) 

Z A/[cos(to/n + ty*) - costfoon + pfi] 
J=0 



n=-N c 

then <& 0 k is approximately equal to 



2 



(46) 



(47) 



r=0 



assuming that N s is on the order of a pitch period or larger. 
This estimate is simply the average of {CD */i} weighted by 
OA,*) 2 . 

Again suppressing frame notation, given an initial funda- 
mental frequency estimate <0 , o =2nP JF Jt it is possible to 
arrange a subset of the analyzed sinusoidal model param- 
eters in the quasi-harmonic form of Equation 45 and to 
update the fundamental frequency estimate recursively. This 
is accomplished by passing through the frequency param- 
eters in order of decreasing amplitude and calculating each 
frequency's harmonic number, defined as <xaj(0^. If this 
equals the harmonic number of any previous component, the 
component is assigned to the set of parameters £ which are 
excluded from the quasi-harmonic representation; other- 
wise, the component is included in the quasi-harmonic set, 



55 



60 



65 
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and its parameters are used to update co 0 according to 
Equation 47. Any harmonic numbers left unassigned are 
associated with zero-amplitude sinusoids at appropriate 
multiples of the final value of co 0 . 

In the case of speech signals, the above algorithm must be 
refined, since a reliable initial estimate is usually not avail- 
able. The following procedure is used to define and choose 
from a set of candidate fundamental frequency estimates: 
Since, in conditions of low-energy, wideband interference, 
high-amplitude components correspond to signal compo- 
nents, it may be assumed that the frequency f of the highest 
amplitude component whose frequency is in the range from 
100 to 1000 Hz is approximately some multiple of the actual 
pitch frequency, i.e. f>=f7i for some i. 

In order to determine an appropriate value of i, a set of 
values of i are determined such that f/i falls in the range from 
40 to 400 Hz, the typical pitch frequency range for human 
speech. For each i in this set the recursive fundamental 
frequency estimation algorithm is performed as described 
above, using an initial frequency estimate of co' 0 [i]=27rf c [i] 
/Fj, where f 0 [iJ=f /i. Given the resulting refined estimate, a 
measure of the error power induced over the speech data 
frame by fixing the quasi-harmonic frequencies to harmonic 
values may be derived, yielding 



(48) 



Due to the inherent ambiguity of fundamental frequency 
estimates, a second error measure is necessary to accurately 
resolve which candidate is most appropriate. This second 
quantity is a measure of the error power induced by inde- 
pendently organizing the parameters in quasi-harmonic form 
and quantizing the amplitude parameters to an optimal 
constant multiple of the vocal tract spectral magnitude at the 
component frequencies, given by 



Hi"- 



2> ) 



(49) 



where P tf is the power of the parameter set £ excluded from 
the quasi-harmonic representation, 



p -JL 
Fe ~ 2 



Z A/, 



45 



and where 



= Iaw^U 
(=0 



50 



1=0 



V _ 



(50) 



(51) 



(52) 



At this point a composite error function P^i] is con- 
structed as Pj{i]=P/fP 0 , ^ & e refined estimate oa 0 [i] cor- 
responding to the minimum value of P7O] is chosen as the 
final estimate C0 o . This algorithm is illustrated in flowchart 
form by FIGS. 7 and 9. In the case where interference is 
sufficiently strong or narrowband that the analyzed compo- 
nent at frequency f cannot be assumed to be a signal 
component, then the algorithm described above nay still be 
employed, using a predefined set of candidate frequencies 
which are independent of the analyzed parameters. Funda- 
mental frequency estimator 109 then transmits (o a via path 
125 to parameter encoder 112 for subsequent transmission to 
storage element 113. 

Harmonic assignment block 111 responds to the funda- 
mental frequency estimate co 0 and the model parameters 
determined by analysis-by-synthesis to produce a quasi- 
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harmonic parameter set as in Equation 45. This is accom- 
plished by assigning each successive component a harmonic 
number given by <(aj^^> in order of decreasing amplitude, 
retraining from assigning components whose harmonic 
numbers conflict with those of previously assigned compo- 5 
nents. The resulting parameter set thus includes as many 
high-amplitude components as possible in the quasi-har- 
monic parameter set. The harmonic assignment algorithm is 
illustrated in flowchart form by FIG. 10. Harmonic assign- 
ment block 111 then transmits the quasi-harmonic model 1Q 
amplitudes {A,*}, differential frequencies {Ay*} and phases 
via paths 126, 127 and 128 respectively, to parameter 
encoder 112 for subsequent transmission to storage 
element 113. 

While the time-varying gain sequence a[n] acts to 
increase model accuracy during transition regions of speech 15 
signals and improves tie performance of analysis in these 
regions, it is not absolutely required for the model to 
function, and the additional computation required to esti- 
mate a[n] may outweigh the performance improvements for 
certain applications. Therefore, a second version of a speech 20 
analyzer which operates without said time-varying gain 
(equivalent to assuming that a[n]=l) is illustrated in FIG. 11. 

Speech analyzer 1100 operates identically to speech ana- 
lyzer 100 with the following exceptions: The signal path 
dedicated to calculating, transmitting and framing a[n} is 25 
eliminated, along with the functional blocks associated 
therewith. A second difference is seen by considering the 
formulas giving DFT's XG[m] and GG[m] in Equation 44 
for the case when cr[n]=l; 

30 

N 0 (53) 
XG[m]= £ w a [n]s[n + kNs)Wg 

N c 

GG[m] = E 0)ffT/i]W. 

n=-N a 35 

That is, XG[m] is now the DFT of the speech data frame 
multiplied by the analysis window, and GG[m] is simply the 
DFT of the analysis window function, which may be cal- 
culated once and used as a fixed function thereafter. 40 

Analysis window block 1103 responds to a frame of 
speech data received via path 1121 to multiply said data 
frame by the analysis window function w a [n] to produce the 
sequence x[nlg[n]. Fast Fourier Transform block 1105 
responds to x[n]g[n] to produce the M-point DFT XG[ml 45 
defined above. Read-only memory block 1104 serves to 
store the precalculated DFT GG[m] defined above and to 
provide this DFT to analysis-by-synthesis block 1106 as 
needed. All other algorithmic components of speech ana- 
lyzer 1100 and their structural relationships are identical to 50 
those of speech analyzer 100. 

FIG. 12 illustrates an analyzer embodiment of the present 
invention appropriate for the analysis of pitched musical 
tone signals. Musical tone analyzer 1200 of FIG. 12 
responds to analog musical tone signals in order to deter- 55 
mine sinusoidal model parameters in a fashion similar to 
speech analyzer 100. Musical tone analyzer 1200 digitizes 
and quantizes analog musical signals received via path 1220 
using A/D converter 1201 in the same manner as A/D 
converter 101 60 

lime-varying gain calculator 1202 responds to the data 
stream produced by A/D converter 1201 to produce an 
envelope sequence a[n] as described in speech analyzer 100. 
The same filtering operation of Equation 2 is used; however, 
the filter parameters X and n^ are varied as a function of the 65 
nominal expected pitch frequency of the tone, ca' 0 , received 
via path 1221 according to the relation 



16 

j^o^-M^T, < 54 > 

where £=2-cosa)' 0 , and n c is calculated using Equation 6. 
The purpose of this variation is to adjust the filter* s selec- 
tivity to the expected pitch in order to optimize performance. 
Time-varying gain calculator 1202 transmits a[n] via path 
1222 to parameter encoder 1210 for subsequent transmission 
to storage element 1211. 

Overlapping frames of musical signal data and the accom- 
panying frames of envelope data required for analysis are 
isolated from s[n] and o~[n] respectively using frame seg- 
menter blocks 1203 in the same manner as in speech 
analyzer 100. Multiplier block 1204 responds to a musical 
signal data frame received via path 1223 and an envelope 
data frame received via path 1224 to produce the product of 
the data frames. Analysis window block 1206 multiplies the 
output of multiplier block 1204 by the analysis window 
function described in speech analyzer 100, producing the 
product of the sequences x[n] and g[n] defined by Equation 
16. Squarer block 1205 responds to a frame of envelope data 
to produce the square of the envelope data frame; the 
resulting output is input to a second analysis window block 
to produce the sequence g 2 [n]. At this point x[n]g[n] and 
g 2 [n] are input to parallel Fast Fourier Transform blocks 
1207, which yield the M-point DFT's XG[m] and GG[m] 
defined in Equation 44, respectively. 

Harmonically-constrained analysis-by-synthesis block 
1208 responds to the input DFT's XG[m] and GG[m] and to 
(u' 0 to produce sinusoidal model parameters which approxi- 
mate the musical signal data frame. These parameters pro- 
duce s*[n] using the quasi -harmonic representation shown in 
Equation 45. The analysis algorithm used is identical to the 
fast analysis-by-synthesis algorithm discussed in the 
description of speech analyzer 100, with the following 
exception: Since an unambiguous initial fundamental fre- 
quency estimate is available, as each candidate frequency 
(o c [i] is tested to determine the 1-th component of x[n], its 
harmonic number is calculated as <(0 c [i]/a) o *>. If this equals 
the harmonic number of any of the previous 1-1 compo- 
nents, the candidate is disqualified, ensuring that only one 
component is associated with each harmonic number. As 
each new component is determined, the estimate of m 0 h is 
updated according to Equation 47. This algorithm is illus- 
trated in flowchart form by FIGS. 13 through IS. 

Harmonically-constrained analysis-by-synthesis block 
1208 then transmits the fundamental frequency estimate co/ 
and the quasi-harmonic model amplitudes {A,*}, differential 
frequencies {A/} and phases via paths 1225, 1226, 
1227 and 1228 respectively, to parameter encoder 1210 for 
subsequent transmission to storage element 1211. System 
estimator 1209 responds to a musical signal data frame 
transmitted via path 1223 to produce coefficients represen- 
tative of H(e ,aJ ), an estimate of the spectral envelope of the 
quasi-harmonic sinusoidal model parameters. The algo- 
rithms which may be used to determine these coefficients are 
the same as those used in system estimator 110. System 
estimator 1209 then transmits said coefficients via path 1229 
to parameter encoder 1210 for subsequent transmission to 
storage element 1211. 

As previously mentioned, the time-varying gain sequence 
a[n] is not required for the model to function; therefore, a 
second version of a musical tone analyzer that operates 
without said time-varying gain is illustrated in FIG. 16. 
Musical tone analyzer 1600 incorporates the same alter- 
ations as described in the discussion of speech analyzer 200. 
Furthermore, although the spectral envelope HCe* 0 ) is 
required to perform pitch-scale modification of musical 
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signals, when this type of modification is not performed the 
spectral envelope is not required in musical tone analysis. In 
this case, signal paths 1229 and 1620 and functional blocks 
1209 and 1601 are omitted from analyzers 1200 and 1600, 
respectively. 5 

HG. 17 illustrates a synthesizer embodiment of the 
present invention appropriate for the synthesis and modifi- 
cation of speech signals. Speech synthesizer 1700 of FIG. 17 
responds to stored encoded quasi-harmonic sinusoidal 
model parameters previously determined by speech analysis 0 
in order to produce a synthetic facsimile of the original 
analog signal or alternately synthetic speech advantageously 
modified in time- and/or frequency-scale. 

Parameter decoder 1702 responds to the stored encoded 
parameters transmitted from storage element 1701 via path 
1720 to yield the time-varying gain sequence c[n] of Equa- 15 
tion 8 (if calculated in analysis), the coefficients associated 
with vocal tract frequency response estimate HCe 710 ) dis- 
cussed in the description of speech analyzer 100, and the 
fundamental frequency estimate to/, quasi-harmonic model 
amplitudes {A/}, differential frequencies {A/} and phases 20 
{$j } used to generate a synthetic contribution according to 
Equation 45. Although storage element 1701 is shown to be 
distinct from storage element 113 of speech analyzer 100, it 
should be understood that speech analyzer 100 and speech 
synthesizer 1700 may share the same storage element. 25 

Consider now the operation of speech synthesizer 1700 in 
greater detail. Referring to Equations 12 and 45, time- and 
frequency-scale modification may be performed on isolated 
synthesis frames, using different time and frequency scale 
factors in each successive frame if desired. A simple 30 
approach to time-scale modification by a factor p* using the 
overlap-add sinusoidal model is to change the length of 
synthesis frame k from N, to p^N, with corresponding time 
scaling of the envelope sequence c[n] and the synthesis 
window wjn]. Frequency-scale modification by a factor p fc 35 
is accomplished by scaling the component frequencies of 
each synthetic contribution s*[n]. In either case, time shifts 
are introduced to the modified synthetic contributions to 
account for changes in phase coherence due to the modifi- 
cations. 40 

Unfortunately, this simple approach yields modified 
speech with reverberant artifacts as well as a noisy, "rough" 
quality. Examination of Equation 45 reveals why. Since the 
differential frequencies {A^*} are nonzero and independent, 
they cause the phase of each component sinusoid to evolve 
nonuniformly with respect to other components. This **phase 45 
evolution" results in a breakdown of coherence in the model 
as the time index deviates beyond analysis frame bound- 
aries, as illustrated in FIGS. 18A and 18B. Time-shifting this 
extrapolated sequence therefore introduces incoherence to 
the modified speech. 50 

The present invention overcomes the problem of uncon- 
trolled phase evolution by altering the component frequen- 
cies of s*[n] in the presence of modifications according to the 
relation 



This implies that as the time scale factor p k is increased, the 
component frequencies **pull in" towards the harmonic 
frequencies, and in the limit the synthetic contributions 
become purely periodic sequences. The effect is to slow 60 
phase evolution, so that coherence breaks down proportion- 
ally farmer from the analysis frame center to account for the 
longer synthesis frame length. The behavior of a synthetic 
contribution modified in this way is illustrated in FIGS. 19 A 
and 19B. 65 

Based on this new approach, a synthesis equation similar 
to Equation 12 may be constructed: 



18 

(55) 

for 0^n<p Jfc N J , where N^NjZ^^ 1 p f is the starting point 
of the modified synthesis frame, and where 



(56) 



Techniques for determining the time shifts 5* and will 
be discussed later. It should be noted that when pp>l, it is 
possible for the component frequencies of i pt ,pjn] to exceed 
n, resulting in "aliasing." For this reason it is necessary to set 
the amplitude of any component whose modified frequency 
is greater than 71 to zero. 

Pitch onset time estimator 1703 responds to the coeffi- 
cients representing H^ 0 *) received via path 1721, the fun- 
damental frequency estimate received via path 1722, and the 
quasi-harmonic model amplitudes, differential frequencies 
and phases received via paths 1723, 1724 and 1725 respec- 
tively in order to estimate the time relative to the center of 
an analysis frame at which an excitation pulse occurs. This 
function is achieved using an algorithm similar to one 
developed by McAulay and Quatieri in "Phase Modelling 
and its Application to Sinusoidal Transform Coding," Proc> 
IEEE Int'l Conf. on Acoust, Speech and Signal Processing, 
pp. 1713-1715, April 1986, and based on the observation 
that the glottal excitation sequence (which is ideally a 
periodic pulse train) may be expressed using the quasi- 
harmonic sinusoidal representation of Equations 8 and 45, 
where the synthetic contributions s*[n] are replaced by 



e*[n] = I bp cos(o)i*n + 6/*), 
f=0 



(57) 



and where the amplitude and phase parameters of e*[n] are 
given by 



(58) 



This process is referred to as "deconvolution." Assuming for 
simplicity that wf^lwj 1 and suppressing frame notation, 
Equation 57 may be rewritten as 



c[n) = Z fcf cosOoUn - 
M) 



where 



55 



(59) 



(60) 



One of the properties of the vocal tract frequency 
response estimate E(js/° s ) is that the amplitude parameters 
A * are approximately proportional to the magnitude of 
H(e /<u ) at the corresponding frequencies co,*; thus, the decon- 
volved amplitude parameters {b *} are approximately con- 
stant If, in addition, the 'Hime-shifted" deconvolved phase 
parameters {H0(T p )} are close to zero or n for some value of 
Tp (termed "maximal coherence"), then e*[n] is approxi- 
mately a periodic pulse train with a "pitch onset time" of t p . 
By assuming the condition of maximal coherence, an 
approximation to i*[n] may be constructed by reversing the 
deconvolution process of Equation 58, yielding 
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(61) 



fyn) = Z^ cosCic^Hn - T„) + ZB{^) + urn), 

where m is either zero or one. 

The pitch onset time parameter may then be defined as 5 
that value of t which yields the minimum mean-square error 
between s fc [n] and s/[n] over the original analysis frame, 



(62) 



10 



«*[«] 



4*1 

- I 4/*C0S(lGJ e *(n - 

/=0 



•0 + 



mn) 



Assuming that N a is a pitch period or more, this is approxi- 15 
mately equivalent to finding the absolute maximum of the 
pitch onset likelihood function 



L(x)= LA/»cos(v,(t)) 

b=0 



(63) 



Ak] 
1=0 
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in terms of t. Unfortunately, this problem does not have a 
closed-form solution; however, due to the form of 
is periodic with period 2it/G) 0 . Therefore, the pitch onset 
time may be estimated by evaluating L(t) at a number 
(typically greater than 128) of uniformly spaced points on 25 
the interval [-n/(a 0 ,7i/a)J and choosing i p to correspond to 
the maximum of IL(t)I. This algorithm is shown in flowchart 
form in FIGS. 20 and 21. 

DFT assignment block 1704 responds to the fundamental 
frequency w 0 * received via path 1722, the sets of quasi- 3Q 
harmonic model amplitudes, differential frequencies and 
phases received via paths 1723, 1724 and 1725 respectively, 
pitch onset time estimate x p k received via path 1726, fre- 
quency-scale modification factor $ k and time-scale modifi- 
cation factor p k received via paths 1727 and 1728, respec- 35 
tively, to produce a sequence Z[i] which may be used to 
construct a modified synthetic contribution using an FFT 
algorithm. 

Consider the operation of DFT assignment block 1704 in 
greater detail. Referring to Equation 10, since the compo- AQ 
nent frequencies of s*[n] are given by o>*=2jti/M 1 a syn- 
thetic contribution may be expressed as 



(64) 



Recognizing that A / *cos(2ra / n/M-Ht>,*)=«e{A ; *e' : ' aiu ' 1 n/M+ 
this becomes 



(65) 



Ak] 

^£ Afi costcoi 1 /! -f 0*), 



(66) 



where 



65 
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(68) 



Except for the case when ftppf^l, the modified frequency 
terms no longer fall at multiples of 2tc/M; however, an FFT 
algorithm may still be used to accurately represent 
s pJtiPfc *[n]. Ignoring frame notation, this is accomplished by 
calculating the DFT indices whose corresponding frequen- 
cies are adjacent to <5> z ; 



r 1 



(69) 



» 4 i = (70) 

where |_* J denotes the "greatest integer less than or equal to" 
operator. 

The length of the DFT used in modification synthesis, IVI, 
is adjusted to compensate for the longer frame lengths 
required in time-scale modification and is typically greater 
than or equal to p*M. Each component of ip^p/M is then 
approximated using two components with frequencies 0) lt 
t=2m l JM and £> z /=2 m 2 /M in the following manner 
Given a single sinusoidal component with an unconstrained 
frequency CD, of the form 



(71) 



two sinusoids with constrained frequencies are added 
together to form an approximation to c,[n]: 

crfn) = A\j cos(a)ijn + + Mj cos(a&jn + £x/) C72> 
= a\j cos aiijn + bij sin ©jjn + a 2f | cos a>y/i + 
bu sin (O^n. 

Letting ft t =pJ$ M and using the squared error norm 



£,= Z { C /[n]-c,[n]}2, 



(73) 



minimization of E, in terms of the coefficients of c,[n] leads 
to the conditions 

dE t _ BEt dE t dEi _ (74) 
Expanding the first condition using Equation 72 yields 



45 



£ Cf{n] cos (Di jn = Z crfn] cos cOijn. 



(75) 



Equations 71 and 72 may be substituted into this equation: 
however, noting that 



50 



Thus, by Equation 31, any sequence expressed as a sum of 
constant-amplitude, constant-frequency sinusoids whose 
frequencies are constrained to be multiples of 2tc/M is 
alternately given as the real part of th^ M-point DFT of a 55 
sequence Z[i] with values of A,*e"*** at i=i, and zero 
otherwise. This DFT may be calculated using an FFT 
algorithm. 

According to Equation 56, in the presence of time- and 
frequency- scale modification a synthetic contribution is 60 
given by 



Z cos ocr sin Bn = 0 
n=-/V 



for all a, p and N, the resulting expression simplifies to 



aij Z cos 2 (Oijji +au Z cos ©ijn cos Ojjn = 



(76) 



aj E cos ttyn cos COijn. 
n=-N t 



Similarly, the other conditions of Equation 74 are given by 
the equations 



a\j Z cos ©i,jfi cos 0>2jn + aij Z cos 2 (©yn — 
n=~N M n=~N g 



(77) 
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-continued 



at £ cos tityi cos co^jn. 



S sin 2 ©i/i + £ sin (Oijrt sin catyii = 



(77) 



6i S sin Q>jn ain (Oijn, 



and 



J>1,/ S sin (Di /n sin CD2,/n +- foj £ sin 1 tt>2./n z 
n=-Nj n=-Nj 



09) 



fc( S sia ODjn sin CQ2jn. 
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I cosowcospn = -=- FjXa- 



(80) 



1 rinansinpfl=/*(a-p)-- £ cosancospn, 
n=^-W n=-/V 

where the function F^gd), defined as 
F "< m) = 



(81) 
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may be precalculated and used as required. Given the 
parameters determined from the two sets of normal equa- 
tions, the amplitude and phase parameters of c,[n] are 
derived using the relationships of Equation 27. The resulting 
amplitude and phase parameters can then be assigned to the 
id-point sequence 2[i] as described previously at index 
values i u and i^ 

In speech signals, synthetic contributions are highly cor- 
related from one frame to the next In the presence of 45 
modifications, this correlation must be maintained if the 
resulting modified speech is to be free from artifacts. To 
accomplish this, the time shifts 8* and 8* +1 in Equation 56 
may be determined such that the underlying excitation signal 
obeys specific constraints in both the unmodified and modi- 50 
fied cases. Examining Equation 59, if the component ampli- 
tudes are set to unity and the phases set to zero, a "virtual 
excitation" sequence, or an impulse train with fundamental 
frequency co/ and shifted relative to the synthesis frame 
boundary by i p k samples, results. In "Phase Coherence in 55 
Speech Reconstruction for Enhancement and Coding Appli- 
cations," Proc. IEEE Im'l Conf. on Acoust., Speech and 
Signal Processing, pp. 207-210, May 1989, McAulay and 
Quatieri derive an algorithm to preserve phase coherence in 
the presence of modifications using virtual excitation analy- 60 
sis. The following is a description of a refined version of this 
algorithm. 

As illustrated in FIGS. 22 A and 22B, in synthesis frame 
k the unmodified virtual excitation of the k~th synthetic 
contribution has pulse locations relative to frame boundary 65 
A of T p *+iT a *, where T 0 *^27t/(D 0 *. These impulses are 
denoted by 0*s. Likewise, the pulse locations of the virtual 



22 



excitation of the (k+l)-st synthetic contribution relative to 
frame boundary B are Tp^+iT/* 1 ; these pulses are denoted 
by X's. For some integer i^ a pulse location of the k-th 
contribution is adjacent to frame center C; likewise, for 
some i^, a pulse location of the k+l-st contribution is 
adjacent to frame center C. The values of and i^ can be 
found as 



10 



iA.WJl->t p k )ITj t \ 



(82) 



Equations 76 and 77 form a pair of normal equations in the 
form of Equation 23 which may be solved using the formu- 
las of Equation 25 for a u and ^ tl ; likewise, Equations 78 
and 79 are a second, independent pair of normal equations 
yielding b u and b 2t/ . 

The inner product terms in Equations 76-79 may be 25 
calculated using the relations 



The time difference between the pulses adjacent to frame 
center C is shown as A. 

In the presence of time- and frequency-scale modification, 
the relative virtual excitation pulse locations are changed to 
n=(^+iT 0 *)/p A -5* and n=(V* ^iT/* 1 )^-^ for 
modified synthetic contributions k and k+1, respectively. In 
order to preserve frame-to-frame phase coherence in the 
presence of modifications, the time shift 8* +l must be 
adjusted such that the time difference between pulses adja- 
cent to modified frame center C is equal to A/p av , where 
Pflv=(Pt+P^-iV2. This condition is also shown in FIGS. 22A 
and 22B. The coherence requirement leads to an equation 
which can be solved for 8**\ yielding the recursive relation 



(83) 
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where 



r 0 * + (i fc r 0 *-i k+ iTf , )/p a v 



35 



(84) 



The algorithms involved in DFT assignment block 1704 are 
illustrated in flowchart form in FIGS. 23 and 24. 

FFT block 1705 responds to the complex sequence 2[i] 
produced by DFT assignment block 1704 to produce a 
complex sequence z[n] which is the l5l-point DFT of 2[i] 
according to Equation 31. Overlap-add block 1706 responds 
to the complex sequence output by FFT block 1705, time- 
scale modification factor p fc received via path 1728, and 
time-varying gain sequence o~[n] received via path 1729 to 
produce a contiguous sequence §[n], representative of syn- 
thetic speech, on a frame-by-frame basis. This is accom- 
plished in the following manner: Taking the real part of the 
input sequence i[n] yields the modified synthetic contribu- 
tion sequence ip^p/tn] as in the discussion of DFT assign- 
ment block 1704. Using the relation expressed in Equation 
55, a synthesis frame of s[n] is generated by taking two 
successive modified synthetic contributions, multiplying 
them by shifted and time scaled versions of the synthesis 
window w,[n], adding the two windowed sequences 
together, and multiplying the resulting sequence by the time 
scaled time-varying gain sequence a[n]. 

It should be understood that if speech analysis was 
performed without the time-varying gain sequence, then 
data path 1729 may be omitted from synthesizer 1700, and 
the overlap- add algorithm implemented with o[n]=l. In 
addition, it should be readily apparent to those skilled in the 
art that if only time-scale modification is desired, data path 
1727 may be omitted, and the modification algorithms 
described may be implemented with p t =l for all k. Like- 
wise, if only frequency-scale modification is desired, then 
data path 1728 may be omitted, and the modification algo- 
rithms described may be implemented with p*=l for all k. 
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Given s[n], overlap-add block 1706 then produces an 
output data stream by quantizing the synthetic speech 
sequence using a quantization operator as in Equation 1. 
Digital-to-analog (D/A) converter 1707 responds to the data 
stream produced by overlap-add block 1706 to produce an 5 
analog signal s c (t) which is output from speech synthesizer 
1700 via path 1730. 

While time- and frequency-scale modification of analyzed 
speech is sufficient for many applications, for certain appli- 10 
cations other information must be accounted for when 
performing modifications. For instance, when speech is 
frequency-scale modified using speech synthesizer 1700, the 
component frequencies used in the sinusoidal model are 
changed, but the amplitude parameters are unaltered except is 
as required to prevent aliasing; this results in compression or 
expansion of the "spectral envelope" of analyzed speech (of 
which IHCe 705 )! is an estimate). Since identifiable speech 
sounds are critically determined by this envelope, such 
"spectral distortion" may seriously degrade the intelligibility 20 
of synthetic speech produced by synthesizer 1700. There- 
fore, it is important to consider an approach to altering the 
fundamental frequency of speech while preserving its spec- 
tral envelope; this is known as pitch-scale modification. 

A second version of a speech synthesizer capable of 
performing time- and pitch-scale modification on previously 
analyzed speech signals is illustrated in FIG. 25. Speech 
synthesizer 2500 operates identically to speech synthesizer 
1700, except that an additional step, phasor interpolator 3Q 
2501, is added to counteract the effects of spectral distortion 
encountered in speech synthesizer 1700. 

Phasor interpolator 2501 responds to the same set of 
parameters input to pitch onset time estimator 1703, the 
pitch onset time x p k determined by pitch onset time estimator 35 
2502 received via path 2520, and the pitch-scale modifica- 
tion factor p k received via path 2521 in order to determine 
a modified set of amplitudes {A/}, harmonic differential 
frequencies {A,*}, and phases {fy *} which produce a pitch- 
scale modified version of the original speech data frame. 40 

Consider now the operation of phasor interpolator 2501 in 
greater detail: According to the discussion of pitch onset 
time estimator 1703, a synthetic contribution to the glottal 
excitation sequence as given in Equation 57 is approxi- 
mately a periodic pulse train whose fundamental frequency 
is (0 o \ In a manner similar to the pitch-excited LPC model, 
it might be expected that scaling the frequencies of e*[n] by 
P A and "reconvolving" with HCe 700 ) at the scaled frequencies 
{pco/} would result in synthetic speech with a fundamental 50 
frequency of P*©/ that maintains the same spectral shape of 
HCe**"), and therefore the same intelligibility, as the original 
speech. Unfortunately, since the frequencies of e*[n] span 
the range from zero to rc, this approach results in component 
frequencies spanning the range from zero to p^n. For pitch 55 
scale factors less than one, this "information loss" imparts a 
muffled quality to the modified speech. 

To address this problem, consider the periodic sequence 
obtained from e*[n] by setting co/^lco/: 



45 



60 



e c k [n] = Z bfcostftafn + 8/*). 
M) 
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(where Jpc^IM/p*) which span the frequency range from 
zero to it. Since as a function of frequency the pairs of 
amplitude and phase parameters are evenly spaced, a rea- 
sonable approach to this problem is to interpolate the 
complex "phasor form" of the unmodified amplitude and 
phase parameters across the spectrum and to derive modified 
parameters by resampling this interpolated function at the 
modified frequencies. 
Again suppressing frame notation, this implies that given 

the interpolated function £(co), where 



<=((£))= X bfti°tl{jBi - lab), 
1=0 



(87) 



the modified amplitudes are given by b/=l£(plo> 0 )l, and the 

modified phases by 6y=Z £ (PloJ. 

While any interpolation function I(o) with the properties 
IGcd o )=0 for 1*0 and I(0)=1 may be employed, a raised- 
cosine interpolator of the form 



/((D) : 



CQB 2 (7CaV2c0b), 
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makes the computation of £(cu) much simpler, since all but 
two terms drop out of Equation 87 at any given frequency. 
Furthermore, since I(co) is bandlimited, the effect of any 
single noise-corrupted component of e*[n] on the modified 
parameters is strictly limited to the immediate neighborhood 
of that component's frequency. This greatly reduces the 
problem of inadvertently amplifying the background noise 
during modification by assuring that noise effects concen- 
trated in one part of the spectrum do not "migrate" to another 
part of the spectrum where the magnitude of H(e /m ) may be 
greatly different. 
The discussion of phasor interpolation to this point has 

ignored one important factor; the interpolated function £(co) 
is seriously affected by the phase terms {9J. To see this, 

consider the case when Bf=0 for all 1; in this case, £(co) is 
simply a straightforward interpolation of the amplitude 
parameters. However, if every other phase term is n instead, 

£(00) interpolates adjacent amplitude parameters with oppo- 
site signs, resulting in a very different set of modified 
amplitude parameters, It is therefore reasonable to formulate 
phasor interpolation such that the effects of phase on the 
modified amplitudes is minimized. 

As mentioned above, when the phase terms are all close 
to zero, phasor interpolation approximates amplitude inter- 
polation. Furthermore, examining Equation 87 reveals that 
when the phase terms are all close to n, phasor interpolation 
is approximately interpolation of amplitudes with a sign 
change, and that deviation from either of these conditions 
results in undesirable nonlinear amplitude interpolation. 
Recalling the description of pitch onset time estimator 1703, 
Tp is estimated such that the *time-shifted M phase parameters 
VVfap)} exactly this property. Therefore, the phasor 
interpolation procedure outlined above may be performed 
using {V/(Tp)} instead of {8J, yielding the modified ampli- 
tude parameters {b,} and interpolated phases {V/(T p )}. The 
modified phase terms may then be derived by reversing the 
time shift imparted to (Vi(t p )}: 



The goal of modifying the fundamental frequency of 
e c *[n] without information loss is to specify a set of modified 55 
amplitude and phase parameters for the modified residual 
ep[n], given by 



(89) 



At this point all that remains is to specify appropriate 
differential frequency terms in the equation for e*[n]. 
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Although this task is somewhat arbitrary, it is reasonable to 
expect that the differential frequency terms may be interpo- 
lated uniformly in a manner similar to phasor interpolation, 
yielding 



quency is approximately known a priori. Therefore, a sim- 
pler constraint may be invoked to determine appropriate 
time shifts. Specifically, denoting the phase terms of the 
sinusoids in Equation 56 by 6/[n] and ** l [n] respec- 



(90) 5 tively, where 



This interpolation has the effect that the modified differential 
frequencies follow the same trend in the frequency domain 
as the unmodified differentials, which is important both in io 
preventing migration of noise effects and in modifying 
speech which possesses a noise-like structure in certain 
portions of the spectrum. 

Given the amplitude, phase and differential frequency 
parameters of a modified excitation contribution, the sped- 
fication of a synthetic contribution to pitch-scale modified 
speech may be completed by reintroducing the effects of the 
spectral envelope to the amplitude and phase parameters at 
the modified frequencies ©pspj©/ +A *: 



20 



(91) 



where the multiplicative factor of (3* on the amplitude 
parameters serves to normalize the amplitude of the modi- 
fied speech. The algorithm used in phasor interpolator 2501 
is illustrated in flowchart form in FIGS. 26 and 27. All other 
algorithmic components of speech synthesizer 2500 and 
their structural relationships are identical to those of speech 
synthesizer 1700. As in speech synthesizer 1700, data path 
2522 (which is used to transmit time-scale modification 30 
factor may be omitted if only pitch-scale modification is 
desired, and modification may be implemented with p t =l 
for all k. 

FIG. 28 illustrates a synthesizer embodiment of the 
present invention appropriate for the synthesis and modifi- 35 
cation of pitched musical tone signals. Music synthesizer 
2800 of FIG. 28 responds to stored encoded quasi-harmonic 
sinusoidal model parameters previously determined by 
music signal analysis in order to produce a synthetic fac- 
simile of the original analog signal or alternately synthetic 40 
speech advantageously modified in time- and/or frequency- 
scale. Parameter decoder 2802 responds to encoded param- 
eters retrieved from storage element 2801 via path 2820 in 
a manner similar to parameter encoder 1702 to produce the 
time- varying gain sequence a[n] of Equation 8 (if calculated 45 
in analysis) and the fundamental frequency estimate <0 o *, 
quasi-harmonic model amplitudes {Ay*}, differential fre- 
quencies {A/} and phases used to generate a synthetic 
contribution according to Equation 45. 

DFT assignment block 2803 responds to the fundamental 50 
frequency received via path 2821, the sets of quasi-harmonic 
model amplitudes, differential frequencies and phases 
received via paths 2822, 2823 and 2824 respectively, fre- 
quency-scale modification factor $ k and time-scale modifi- 
cation factor p k received via paths 2825 and 2826, respec- 55 
tively, to produce a sequence 2[i] which may be used to 
construct a modified synthetic contribution using an FFT 
algorithm. The algorithm used in this block is identical to 
that of DFT assignment block 1704 of FIG. 17, with the 
following exception: The purpose of the excitation pulse 60 
constraint algorithm used to calculate time shifts 5* and 5**" 1 
in DFT assignment block 1704 is that the algorithm is 
relatively insensitive to errors in fundamental frequency 
estimation resulting in an estimate which is the actual 
fundamental multiplied or divided by an integer factor. 65 

However, for the case of pitched musical tones, such 
considerations are irrelevant since the fundamental fre- 



(92) 



0/[n]=/pW(fl + S^ + -^- +<(>* 



and denoting the unmodified phase terms from Equation 45 
as §*[n] and <|>/*' 1 [n], a reasonable contraint on the phase 
behavior of corresponding components from each synthetic 
contribution is to require that the differential between the 
unmodified phase terms at the center of the unmodified 
synthesis frame match the differential between the modified 
phase terms at the modified frame center. Formally, this 
requirement is given by 

&j M H>iFA}-&j k [9Wh Oj^ l {~N/l]-Oj k [N/l], for all j. (93) 

Solving this equation for using the phase functions 
just denned yields the recursion 

8 k+l = (94) 



(5* + (Pi - I/POATjA) + (pt - l/P* + i)/V2. 



Note that there is no dependence on j in this recursion, 
verifying that is a global time shift that needs to be 
calculated only once per frame. Furthermore, there is no 
dependence on the pitch onset time estimate % p * as in DFT 
assignment block 1704; therefore, pitch onset time estima- 
tion as in speech synthesizer 1700 is not required for music 
synthesizer 2800. All other algorithmic components of 
music synthesizer 2800 and their structural relationships are 
identical to those of speech synthesizer 1700. As in speech 
synthesizer 1700, if only time-scale modification is desired, 
data path 2825 may be omitted, and the modification algo- 
rithms described may be implemented with p t =l for all k. 
Likewise, if only frequency-scale modification is desired, 
then data path 2826 may be omitted, and the modification 
algorithms described may be implemented with p k =l for 
all k. 

A second version of a music synthesizer capable of 
perforrning time- and pitch-scale modification on previously 
analyzed musical tone signals is illustrated in FIG. 29. 
Music synthesizer 2900 operates identically to speech syn- 
thesizer 2500, with the exception that the time shift param- 
eters used in modification synthesis are calculated according 
to Equation 94. As in speech synthesizer 2500, data path 
2921 (which is used to transmit time-scale modification 
factor pjt) may be omitted if only pitch-scale modification is 
desired, and modification may be implemented with p k =l 
for all k. 

Hie architecture of a possible implementation of an audio 
analysis/synthesis system using a general-purpose digital 
signal processing microprocessor is illustrated in FIG. 30. It 
should be noted that this implementation is only one of many 
alternative embodiments that will be readily apparent to 
those skilled in the art For example, certain subgroups of the 
algorithmic components of the various systems may be 
implemented in parallel using application-specific ICs 
(ASICs), field-programmable gate arrays (FPGA's), stan- 
dard ICs, or discrete components. 
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What is claimed: 

1. A method of synthesizing artifact-free modified speech 
signals from a parameter set and a sequence of frequency- 
scale modification factors, 

the parameter set comprising a sequence of coefficient 5 
sets representative of a sequence of estimates of the 
frequency response of a human vocal tract, a corre- 
sponding sequence of estimates of a fundamental fre- 
quency, and a corresponding sequence of quasi-har- 
monic sinusoidal model parameter sets; io 

each one of the estimates of a fundamental frequency and 
the corresponding quasi-harmonic sinusoidal model 
parameter set comprising a representation of one of a 
sequence of overlapping speech data frames; 

the method comprising the steps of: 15 

(a) estimating, with a pitch onset time estimator respon- 
sive to the sequence of coefficient sets, the sequence 
of estimates of a fundamental frequency, and the 
sequence of quasi-harmonic sinusoidal model 
parameter sets, a sequence of excitation times rela- 20 
tive to the centers of each one of the corresponding 
overlapping speech data frames in the sequence of 
speech data frames at which an excitation pulse 
occurs; 

(b) generating a frequency-domain sequence of data 25 
frames from a discrete Fourier transform assignment 
means responsive to the sequence of excitation 
times, the corresponding sequence of quasi-har- 
monic sinusoidal model parameter sets, the sequence 

of frequency-scale modification factors, and the 30 
sequence of estimates of a fundamental frequency, 

(c) transforming the frequency-domain sequence of 
data frames with an inverse discrete Fourier trans- 
form means to produce a time-domain sequence of 
data frames; 35 

(d) generating a contiguous sequence of speech data 
representative of the modified speech signal from an 
overlap-add means responsive to the time-domain 
sequence of data frames; and 

(c) converting the contiguous sequence of speech data 40 
into an analog signal using a digital-to- analog con- 
verter means to produce the modified speech signal. 

2. The method of claim 1 wherein the parameter set 
further comprises an envelope stream representative of 
time-varying average magnitude, the sequence of overlap- 45 
ping speech data frames is further represented by the enve- 
lope stream, and 

the overlap-add means is additionally responsive to the 
envelope stream. 

3. A method of synthesizing artifact-free modified speech 50 
signals from a parameter set and a sequence of time-scale 
modification factors, 

the parameter set comprising a sequence of coefficient 
sets representative of a sequence of estimates of the 55 
frequency response of a human vocal tract, a corre- 
sponding sequence of estimates of a fundamental fre- 
quency, and a corresponding sequence of quasi-har- 
monic sinusoidal model parameter sets; 

each one of the estimates of a fundamental frequency and g 0 
the corresponding quasi-harmonic sinusoidal model 
parameter set comprising a representation of one of a 
sequence of overlapping speech data frames; 

the method comprising the steps of: 
(a) estimating, with a pitch onset time estimator respon- 65 
sive to the sequence of coefficient sets, the sequence 
of estimates of a fundamental frequency, and the 
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sequence of quasi-harmonic sinusoidal model 
parameter sets, a sequence of excitation times rela- 
tive to the centers of each one of the corresponding 
overlapping speech data frames in the sequence of 
speech data frames at which an excitation pulse 
occurs; 

(b) generating a frequency-domain sequence of data 
frames from a discrete Fourier transform assignment 
means responsive to the sequence of excitation 
times, the corresponding sequence of quasi-har- 
monic sinusoidal model parameter sets, the sequence 
of estimates of a fundamental frequency, and the 
sequence of time-scale modification factors; 

(c) transforming the frequency-domain sequence of 
data frames with an inverse discrete Fourier trans- 
form means to produce a time-domain sequence of 
data frames; 

(d) generating a contiguous sequence of speech data 
representative of the modified speech signal from an 
overlap-add means responsive to the time-domain 
sequence of data frames and the sequence of time- 
scale modification factors; and 

(e) converting the contiguous sequence of speech data 
into an analog signal using a digital-to-analog con- 
verter means to produce the modified speech signal, 

4. The method of claim 3 wherein the parameter set 
further comprises an envelope stream representative of 
time-varying average magnitude, the sequence of overlap- 
ping speech data frames is further represented by the enve- 
lope stream, and 

the overlap-add means is additionally responsive to the 
envelope stream. 

5. A method of synthesizing artifact-free modified speech 
signals from a parameter set and a sequence of time-scale 
modification factors, 

the parameter set comprising a sequence of coefficient 
sets representative of a sequence of estimates of the 
frequency response of a human vocal tract, a corre- 
sponding sequence of estimates of a fundamental fre- 
quency, and a corresponding sequence of unmodified 
quasi-harmonic sinusoidal model parameter sets; 

each one of the estimates of a fundamental frequency and 
the corresponding quasi-harmonic sinusoidal model 
parameter set comprising a representation of one of a 
sequence of overlapping speech data frames; 

the method comprising the steps of: 

(a) estimating, with a pitch onset time estimator respon- 
sive to the sequence of coefficient sets, the sequence 
of estimates of a fundamental frequency, and the 
sequence of unmodified quasi-harmonic sinusoidal 
model parameter sets, a sequence of excitation times 
relative to the centers of each one of the correspond- 
ing overlapping speech data frames in the sequence 
of speech data frames at which ah excitation pulse 
occurs; 

(b) generating a sequence of modified quasi-harmonic 
sinusoidal model parameter sets with a phasor inter- 
polator responsive to the sequence of excitation 
times, the sequence of pitch-scale modification fac- 
tors, the sequence of estimates of the fundamental 
frequency, the sequence of coefficient sets, and the 
sequence of unmodified quasi-harmonic sinusoidal 
model parameter sets, each of the modified quasi- 
harmonic sinusoidal model parameter sets compris- 
ing a set of modified amplitudes, a corresponding set 
of modified frequencies, and a corresponding set of 
modified phases; 
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(c) generating a frequency-domain sequence of data 
frames from a discrete Fourier transform assignment 
means responsive to the sequence of excitation 
times, the corresponding, sequence of modified 
quasi-harmonic sinusoidal model parameter sets, the 
sequence of pitch-scale modification factors, and the 
sequence of estimates of a fundamental frequency; 

(d) transforming the frequency-domain sequence of 
data frames with an inverse discrete Fourier trans- 
form means to produce a time-domain sequence of 
data frames; 

(e) generating a contiguous sequence of speech data 
representative of the modified speech signal from an 
overlap-add means responsive to the time-domain 
sequence of data frames; and 

(f) converting the contiguous sequence of speech data 
into an analog signal using a digital-to-analog con- 
verter means to produce the modified speech signal. 

6. The method of claim 5 wherein the parameter set 
further comprises an envelope stream representative of 20 
time-Yarying average magnitude, the sequence of overlap- 
ping speech data frames is further represented by the enve- 
lope stream, and 

the overlap-add means is additionally responsive to the 
envelope stream. 

7. A method of synthesizing artifact-free modified musical 
tone signals from a parameter set and a sequence of fre- 
quency-scale modification factors; 

the parameter set comprising a sequence of fundamental 
frequency estimates and a sequence of quasi-harmonic 
sinusoidal model parameter sets; 

the method comprising the steps of: 

(a) generating a frequency-domain sequence of data 
frames from a discrete Fourier transform assignment 
means responsive to the sequence of fundamental 
frequency estimates, the corresponding sequence of 
quasi-harmonic sinusoidal model parameter sets, and 
the sequence of frequency-scale modification fac- 
tors; 

(b) transforming the frequency-domain sequence of 
data frames with an inverse discrete Fourier trans- 
form means to produce a time-domain sequence of 
data frames; 

(c) generating a contiguous sequence of music data 45 
representative of the modified musical tone signals 
from an overlap-add means responsive to the time- 
domain sequence of data frames; and 

(d) generating the contiguous sequence of music data 
into an analog signal using a digital-to-analog con- 
verter means to produce the modified musical tone 
signal. 

8. The method of claim 7 wherein the parameter set 
further comprises an envelope stream representative of 
time- varying average magnitude, and the overlap-add means 
is additionally responsive to the envelope stream. 

9. A method of synthesizing artifact-free modified musical 
tone signals from a parameter set and a sequence of time- 
scale modification factors; 

the parameter set comprising a sequence of fundamental 
frequency estimates and a sequence of quasi-harmonic 
sinusoidal model parameter sets; 
the method comprising the steps of: 
(a) generating a frequency-domain sequence of data 
frames from a discrete Fourier transform assignment 
means responsive to the sequence of fundamental 
frequency estimates, the corresponding sequence of 
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quasi-harmonic sinusoidal model parameter sets, and 
the sequence of time-scale modification factors; 

(b) transforming the frequency-domain sequence of 
data frames with an inverse discrete Fourier trans- 
form means to produce a time-domain sequence of 
data frames; 

(c) generating a contiguous sequence of music data 
representative of the modified musical tone signals 
from an overlap-add means responsive to the time- 
domain sequence of data frames and the sequence of 
time-scale modification factors; and 

(d) converting the contiguous sequence of music data 
into an analog signal using a digital-to-analog con- 
verter means to produce the modified musical tone 
signal. 

10. The method of claim 9 wherein the parameter set 
further comprises an envelope stream representative of 
time- varying average magnitude, and the overlap-add means 
is additionally responsive to the envelope stream. 

U. A method of synthesizing artifact-free modified musi- 
cal tone signals from a parameter set and a sequence of 
pitch-scale modification factors; 
the parameter set comprising, a sequence of coefficient 
sets representative of a sequence of estimates of a 
spectral envelope, a corresponding sequence of esti- 
mates of a fundamental frequency, and a corresponding 
sequence of unmodified quasi-harmonic sinusoidal 
model parameter sets; 
each one of the estimates of a fundamental frequency and 
the corresponding quasi-harmonic sinusoidal model 
parameter set comprising a representation of one of a 
sequence of overlapping musical tone data frames; 
the method comprising the steps of: 

(a) estimating, with a pitch onset time estimator respon- 
sive to the sequence of coefficient sets, the sequence 
of estimates of a fundamental frequency, and the 
sequence of unmodified quasi-harmonic sinusoidal 
model parameter sets, a sequence of excitation times 
relative to the centers of each one of the correspond- 
ing musical tone data frames in the sequence of 
speech data frames at which an excitation pulse 
occurs; 

(b) generating a sequence of modified quasi-harmonic 
sinusoidal model parameter sets with a phasor inter- 
polator responsive to the sequence of excitation 
times, the sequence of pitch-scale modification fac- 
tors, the sequence of estimates of the fundamental 
frequency, the sequence of coefficient sets and the 
sequence of unmodified quasi-harmonic sinusoidal 
model parameter sets; 

(c) generating a frequency-domain sequence of data 
frames from a discrete Fourier transform assignment 
means responsive to the sequence of modified quasi- 
harmonic sinusoidal model parameter sets, the 
sequence of pitch-scale modification factors, and the 
sequence of estimates of a fundamental frequency; 

(d) transforming the frequency-domain sequence of 
data frames with an inverse discrete Fourier trans- 
form means to produce a time-domain sequence of 
data frames; 

(e) generating a contiguous sequence of musical data 
representative of the modified musical tone signal 
from an overlap-adder responsive to the time-do- 
main sequence of data frames; and 

(f) converting the contiguous sequence of speech data 
into an analog signal using a digital-to-analog con- 
verter means to produce the modified tone signal. 
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12. The method of claim 11 wherein the parameter set 
further comprises an envelope stream representative of 
time-varying average magnitude, and the overlap-adder s 
additionally responsive to the envelope stream. 

13. An apparatus for generating a signal representative of 5 
a synthetic speech waveform from a set of parameters 
representative of overlapping speech data frames stored in a 
memory means, and a sequence of frequency scale modifi- 
cation factors; 

the set of parameters comprising a sequence of quasi- 10 
harmonic sinusoidal model parameter sets, a sequence 
of coefficient sets representative of a frequency 
response of a human vocal tract, and a sequence of 
fundamental frequency estimates, 

the apparatus comprising; 15 

(a) a pitch onset time estimator means electrically 
coupled to the memory means and responsive to the 
sequence of coefficient sets, the sequence of funda- 
mental frequency estimates, and the sequence of 
quasi-harmonic sinusoidal model parameter sets for 2Q 
generating a first signal representative of a sequence 

of excitation times relative to the center of each of 
the corresponding speech data frames at which an 
excitation pulse occurs; 

(b) a discrete Fourier transform assignment means ^ 
electrically coupled to the memory means and 
responsive to the sequence of fundamental frequency 
estimates, the sequence of quasi-harmonic sinusoidal 
model parameter sets, the first signal, and the 
sequence of frequency-scale modification factors for 3Q 
producing a second signal from which a modified 
synthetic contribution may be generated using a 
discrete Fourier transform algorithm; 

(c) a discrete Fourier transform means responsive to the 
second signal for generating a transformed signal; 35 
and 

(d) an overlap-add means responsive to the transformed 
signal for generating the signal representative of the 
synthetic speech waveform. 

14. The apparatus of claim 13, wherein the speech infor- ^ 
mation further comprises an envelope stream representative 

of time-varying average magnitude, and the overlap-add 
means is electrically coupled to the memory means and is 
additionally responsive to the envelope stream. 

15. An apparatus for generating a signal representative of 45 
a synthetic speech waveform from a set of parameters 
representative of overlapping speech data frames stored in a 
memory means and a sequence of time-scale modification 
factors, 

the set of parameters comprising a sequence of quasi- 50 
harmonic sinusoidal model parameter sets, a sequence 
of coefficient sets representative of a frequency 
response of a human vocal tract, and a sequence of 
fundamental frequency estimates, 

the apparatus comprising: 55 

(a) a pitch onset time estimator means electrically 
coupled to the memory means and responsive to the 
sequence of coefficient sets, the sequence of funda- 
mental frequency estimates, and the sequence of 
quasi-harmonic sinusoidal model parameter sets for 60 
generating a first signal representative of a sequence 

of excitation times relative to the center of each of 
the corresponding speech data frames at which an 
excitation pulse occurs; 

(b) a discrete Fourier transform assignment means 65 
electrically coupled to the memory means and 
responsive to the sequence of fundamental frequency 



estimates, the sequence of quasi-harmonic sinusoidal 
model parameter sets, the first signal, and the 
sequence of time-scale modification factors for pro- 
ducing a second signal from which a modified syn- 
thetic contribution may be generated using a discrete 
Fourier transform algorithm; 

(c) a discrete Fourier transform means responsive to the 
second signal for generating a transformed signal; 
and 

(d) an overlap-add means responsive to the transformed 
signal and the sequence of time-scale modification 
factors for generating the signal representative of the 
synthetic speech waveform. 

16. The apparatus of claim 15, wherein the speech infor- 
mation further comprises an envelope stream representative 
of time-varying average magnitude, and the overlap-add 
means is electrically coupled to the memory means and is 
additionally responsive to the envelope stream. 

17. An apparatus for generating a synthetic speech wave- 
form from a set of parameters representative of overlapping 
speech data frames stored in a memory means and a 
sequence of pitch-scale modification factors; 

the speech information comprising a sequence of quasi- 
harmonic sinusoidal model parameter sets, a sequence 
of coefficient sets representative of a frequency 
response of a human vocal tract, and a sequence of 
fundamental frequency estimates, 

the apparatus comprising: 

(a) a pitch onset time estimator means electrically 
coupled to the memory means and responsive to the 
sequence of coefficient sets, the sequence of funda- 
mental frequency estimates, and the sequence of 
quasi-harmonic sinusoidal model parameter sets for 
generating a first signal representative of a sequence 
of time estimates relative to the center of each of the 
frames at which an excitation pulse occurs; 

(b) a phasor interpolator means electrically coupled to 
the memory means and the pitch onset time estimator 
means and responsive to the sequence of coefficient 
sets, the sequence of fundamental frequency esti- 
mates, the sequence of quasi-harmonic sinusoidal 
model parameter sets, the first signal, and the 
sequence of pitch-scale modification factors for gen- 
erating a sequence of modified quasi-harmonic sinu- 
soidal model parameter sets; 

(c) a discrete Fourier transform assignment means 
electrically coupled to the phasor interpolator means 
and the pitch onset time estimator means and respon- 
sive to the sequence of fundamental frequency esti- 
mates, the sequence of modified quasi-harmonic 
sinusoidal model parameter sets, the first signal and 
the sequence of pitch-scale modification factors for 
producing a second signal from which a modified 
synthetic contribution may be generated using a 
discrete Fourier transform algorithm; 

(d) a discrete Fourier transform means responsive to the 
second signal for generating a transformed signal; 
and 

(e) an overlap-add means responsive to the transformed 
signal for generating the signal representative of the 
synthetic speech waveform. 

18. The apparatus of claim 17, wherein the speech infor- 
mation further comprises an envelope stream representative 
of time-varying average magnitude, and the overlap-add 
means is electrically coupled to the memory means and is 
additionally responsive to the envelope stream. 

19. An apparatus for generating a signal representative of 
a synthetic musical waveform from a set of parameters 
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representative of overlapping musical tone data frames 
stored in a memory means and a sequence of frequency scale 
modification factors; 
the parameter set comprising a sequence of quasi-har- 
monic sinusoidal model parameter sets and a sequence 5 
of fundamental frequency estimates, 

the apparatus comprising: 

(a) a discrete Fourier transform assignment means 
electrically coupled to the memory means and 
responsive to the sequence of fundamental frequency 
estimates, the sequence of quasi-harmonic sinusoidal 
model parameter sets, and the sequence of fre- 
quency-scale modification factors for producing a 
first signal from which a modified synthetic contri- 
bution may be generated using a discrete Fourier 
transform algorithm; 

(b) a discrete Fourier transform means responsive to the 
first signal for generating a transformed signal; and 

(c) an overlap-add means responsive to the transformed 
signal for generating the signal representative of the 
synthetic musical waveform. 

20. The apparatus of claim 19 wherein the musical 
information further comprises an envelope stream represen- 
tative of time-varying average magnitude, and the overlap- 
add means is electrically coupled to the memory means and 
is additionally responsive to the envelope stream. 

21. An apparatus for generating a signal representative of 
a synthetic musical waveform from a set of parameters 
representative of overlapping musical tone data frames 
stored in a memory means and a sequence of time-scale 
modification factors; 

the parameter set comprising a sequence of quasi-har- 
monic sinusoidal model parameter sets and a sequence 
of fundamental frequency estimates, 35 

the apparatus comprising: 

(a) a discrete Fourier transform assignment means 
electrically coupled to the memory means and 
responsive to the sequence of fundamental frequency 
estimates, the sequence of quasi-harmonic sinusoidal 40 
model parameter sets, and the sequence of fre- 
quency-scale modification factors for producing a 
first signal from which 'a modified synthetic contri- 
bution may be generated using a discrete Fourier 
transform algorithm; 45 

(b) a discrete Fourier transform means responsive to the 
first signal for generating a transformed signal; and 

(c) an overlap-add means responsive to the transformed 
signal and the sequence of time-scale modification 
factors for generating the signal representative of the 50 
synthetic musical waveform. 

22. The apparatus of claim 21 wherein the musical 
information further comprises an envelope stream represen- 
tative of time-varying average magnitude, and the overtap- 
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add means is electrically coupled to the memory means and 
is additionally responsive to the envelope stream. 

23. An apparatus for generating a signal representative of 
a synthetic musical tone waveform from a set of parameters 
representative of overlapping frames of musical data stored 
in a memory means and a sequence of pitch-scale modifi- 
cation factors; 

the musical information comprising a sequence of quasi- 
harmonic sinusoidal model parameter sets, a sequence 
of coefficient sets representative of estimates of a 
spectral envelope, and a sequence of fundamental fre- 
quency estimates, 

the apparatus comprising: 

(a) a pitch onset time estimator means electrically 
coupled to the memory means and responsive to the 
sequence of coefficient sets, the sequence of funda- 
mental frequency estimates, and the sequence of 
quasi-harmonic sinusoidal model parameter sets for 
generating a first signal representative of a sequence 
of time estimates relative to the center of each of the 
frames at which an excitation pulse occurs; 

(b) a phasor interpolator means electrically coupled to 
the memory means and the pitch onset time estimator 
means and responsive to the sequence of coefficient 
sets, the sequence of fundamental frequency esti- 
mates, the sequence of quasi-harmonic sinusoidal 
model parameter sets, the first signal, and the 
sequence of pitch-scale modification factors for gen- 
erating a sequence of modified quasi-harmonic sinu- 
soidal model parameter sets; 

(c) a discrete Fourier transform assignment means 
electrically coupled to the phasor interpolator means 
and responsive to the sequence of fundamental fre- 
quency estimates, the sequence of modified quasi- 
harmonic sinusoidal model parameter sets, and the 
sequence of pitch-scale modification factors for pro- 
ducing a second signal from which a modified syn- 
thetic contribution may be generated using a discrete 
Fourier transform algorithm; 

(d) a discrete Fourier transform means responsive to the 
second signal for generating a transformed signal; 
and 

(e) an overlap-add means responsive to the transformed 
signal for generating the representative of the syn- 
thetic musical tone waveform. 

24. The apparatus of claim 23, wherein the musical 
information further comprises an envelope stream represen- 
tative of time-varying average magnitude, and the overlap- 
add means is electrically coupled to the memory means and 
is additionally responsive to the envelope stream. 
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